Markov chains improve the significance computation of overlapping genome annotations

Abstract Motivation Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing P-values on the scale of the whole human genome. Results We show that finding the P-values under the typically used ‘gold’ null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the P-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the P-values under the Markovian null hypothesis in O(m2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy. Availability and implementation The software is available at https://github.com/fmfi-compbio/mc-overlaps. All data for reproducibility are available at https://github.com/fmfi-compbio/mc-overlaps-reproducibility. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
A genome annotation is a fundamental component of biological analysis. Mathematically, it can be viewed as a set of genomic intervals with some given characteristic. For example, it is used to represent repeat elements (Jurka, 2000), genes (Venter et al., 2001) or non-coding RNAs (Bartel, 2009). Genome annotations are often compared against each other to determine whether the underlying processes generating them are related; this is sometimes referred to as colocalization analysis. For example, if one annotation contains the nucleosomes with H3K4Me3 histone modifications and another contains known promoter regions, then we can count the nucleosome regions that overlap at least one promoter region. A large number of overlapping regions would suggest that transcription initiation is correlated with the presence of H3K4Me3 histones (Guenther et al., 2007). A robust measure of statistical significance becomes important in order to ascertain whether the overlaps are due to chance or due to a true association.
There have been several approaches to colocolization analysis, surveyed by Dozmorov (2017) and Kanduri et al. (2019). Kanduri et al. (2019) characterize these methods by (i) the choice of the colocalization statistic, e.g. the number of overlapping intervals (Layer et al., 2013), the number of shared bases (Zarrei et al., 2015) or the distances between the closest elements (Chikina and Troyanskaya, 2012), (ii) the choice of the null hypothesis, e.g. permutational (Coarfa et al., 2014) or with fixed interval positions (Sheffield and Bock, 2016) and (iii) the algorithm to compute the Pvalue, e.g. an analytical (McLean et al., 2010) or a sampling (Yu et al., 2015) algorithm.
A natural and popular choice of the null hypothesis (which we call the gold standard, or H GS 0 ) is that one of the two compared annotations (called query) is drawn uniformly at random from the set of all possible non-overlapping rearrangements of the query's intervals, whereas the other annotation (called reference) is fixed. The gold standard and its extensions take into account the lengths of intervals and are more flexible than approaches such as those that fix interval positions (Kanduri et al., 2019). However, null hypotheses of such type were, until recently, only used with P-value algorithms based on rejection sampling (Kanduri et al., 2019). The drawbacks of rejection sampling are that it cannot accurately estimate small P-values and is too slow when the query has high coverage or a high number of intervals. A major advance was recently made by Sarmashghi and Bafna (2019), who gave two analytical (non-sampling) P-value algorithms under slight modifications to H GS 0 , showing a potential way to reap the benefits of H GS 0 while overcoming its computational challenges.
The most promising algorithm of Sarmashghi and Bafna (2019), which we refer to as SBDP, assumes a modified H GS 0 which is forced to maintain the order of intervals in the query. SBDP computes the exact P-value under this modified H GS 0 in time OðmnLÞ and memory OðmLÞ, where m and n are the numbers of intervals in the reference and query, respectively, L is the length of the chromosome, and 0 < 1 is a scaling factor. The scaling factor trades off speed/memory for accuracy by shrinking the chromosome and all the intervals by a factor of . Experiments show that SBDP tends to be accurate with respect to H GS 0 when ¼ 1, in spite of the order restriction. However, the running time is proportional to L, making SBDP only pseudo-polynomial and requiring substantial scaling to run on human annotations. Such scaling can have a detrimental affect on the accuracy and result in biased P-values (Sarmashghi and Bafna, 2019). Overall, though SBDP has for the first time allowed analytically computing P-values under H GS 0 in many smaller instances, a method that is fast and accurate for the human genome has remained elusive.
In this article, we first show that calculating the P-value under H GS 0 is N P-hard (Section 3), which explains the lack of algorithms that are both efficient and accurate and motivates the need for alternate null hypotheses. In order to test the extent to which alternate null hypotheses differ from H GS 0 , we give an algorithm for sampling under H GS 0 in Oðn log LÞ time per sample (Section 4). We are not aware of previous descriptions of a non-rejection-based sampling scheme. Then, we propose an alternate null hypothesis based on Markov chains (Section 5). Our Markov chain null hypothesis (H MC 0 ) is that the query is generated by a Markov chain with two states, corresponding to 'outside' and 'inside' an interval, with transition probabilities that maximize the likelihood of observing the query Q. Under H MC 0 , the asymptotically expected lengths of the intervals and the gaps between them are equal to average lengths of the intervals and gaps in Q, respectively. We then give an algorithm MCDP to compute the exact P-value in time Oðm 2 þ nÞ and memory OðmÞ (Section 6), which avoids any dependence on the genome length L.
Our experimental results on both synthetic and real datasets show that MCDP is generally accurate with respect to H GS 0 while using negligible memory and running efficiently on human-scale datasets (Section 7). MCDP outperforms SBDP in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy. In our largest test case, we compared the set of all human exons (217 527 intervals) with the set of all known copy number losses (23 438 intervals), which is a nearly 10-fold increase in size in comparison to H3K4me3, the largest evaluation dataset of Sarmashghi and Bafna (2019). MCDP ran in <17 h and used less than half a gigabyte of RAM, while SBDP exceeded the memory capacity of our server ($100 GB) when used with ¼ 10 À3 . Furthermore, we believe that the flexibility and power of our Markov chain approach allow modifications to the gold standard null hypothesis that address many of the biological use-cases described by Kanduri et al. (2019).

Preliminaries
Let L > 0 be an integer which we refer to as the genome length. Given two positions x 2 f0; . . . ; Lg and y 2 f0; . . . ; Lg, with y > x, the interval ½x; yÞ refers to the interval between x and y -1, including the endpoints. This type of interval is sometimes said to be halfopen and to be contained in ½0; LÞ. The length of an interval is yx. The weight of a set of intervals is the sum of their lengths. Let A be a set of non-intersecting intervals contained in ½0; LÞ. We let lens(A) denote the multiset of lengths of intervals in A. We let gaps(A) denote the sequence of lengths of the gaps between adjacent intervals of A, starting from the beginning and up to the end of the genome. For example, if L ¼ 10 and A ¼ f½3; 4Þ; ½5; 6Þ; ½7; 9Þg, then lensðAÞ ¼ f1; 1; 2g and gapsðAÞ ¼ ð3; 1; 1; 1Þ. Given two sets of intervals A and A 0 , we let KðA; A 0 Þ denote the number of intervals in A that intersect some interval in A 0 . We say that a non-intersecting set of intervals is separated if all the gaps, except the first and the last, are at least 1. We formally define an annotation to be a set of nonintersecting intervals that is separated and contained in ½0; LÞ.
3 The P-value problem and the gold standard null In this section, we define the P-value problem and the gold standard null hypothesis (H GS 0 ) and then show that the P-value problem under H GS 0 is N P-hard. Let R (reference) and Q (query) be two annotations. We assume that R is sorted. Let H 0 be a null hypothesis for the distribution of Q. The P-value problem for the overlap between genome annotations is to output the sequence ðp 0 ; . . . ; p jRj Þ, where, for 0 k jRj, The gold standard null hypothesis (H GS 0 ) generates an annotation Q rand uniformly at random from a sample space of all annotations such that lensðQ rand Þ ¼ lensðQÞ. We now show that the problem of computing P-values for the overlap between genome annotations under the gold standard null hypothesis is N P-hard. This helps explain the lack of a polynomial time algorithm and motivates us to consider an alternate null hypothesis. THEOREM 1. The P-value problem for the overlap between genome annotations under the gold standard null hypothesis is N P-hard. PROOF. We proceed by a reduction from the N P-complete multiprocessor scheduling problem (Garey and Johnson, 1979). The multiprocessor scheduling problem takes a multiset of N tasks with positive integer lengths a ! ¼ fa 1 ; . . . ; a N g and positive integers w and b ! 2 and decides whether it is possible to partition the tasks into b (some possibly empty) batches, such that the sum of lengths for each batch is less or equal to w. Let a ! , b and w be an instance of the multiprocessor scheduling problem. Without loss of generality, we assume that all lengths in a ! are at least 2. This can be ensured by multiplying w and all input lengths by 2. Furthermore, without loss of generality, we assume that wb ! P i a i , since otherwise the scheduling problem trivially has no solution.
From this instance of the scheduling problem, we define an instance of the P-value problem under the H GS 0 . We set the genome length to be L :¼ bðw þ 1Þ À 2 and define the reference annotation as having b -1 intervals with interval i being ½iðw þ 1Þ À 2; iðw þ 1ÞÞ, for 1 i b À 1. Conceptually, this partitions the genome into b equally sized empty regions, separated by b -1 reference intervals. The length of each empty region is w -1 and of each reference interval is 2. The query annotation Q is defined as N intervals whose lengths are given by subtracting 1 from each of the task lengths, i.e. lensðQÞ ¼ fa 1 À 1; . . . ; a N À 1g. Note that the locations of these intervals is irrelevant for the P-value problem under H GS 0 . Figure 1 depicts an example of the construction. We claim that for any annotation Q 0 with lensðQ 0 Þ ¼ lensðQÞ, there exists a solution to the scheduling problem iff KðR; Q 0 Þ ¼ 0. The scheduling instance has a ! ¼ f2; 2; 2; 2; 4; 7g, N ¼ 6, b ¼ 3 and w ¼ 7.
The resulting instance of the P-value problem has the length of the genome as L ¼ 22, the reference interval set as R ¼ f½6; 8Þ; ½14; 16Þg, and the multiset of query interval lengths as lensðQÞ ¼ f1; 1; 1; 1; 3; 6g. The figure shows R and an annotation Q 0 with lensðQ 0 Þ ¼ lensðQÞ and no overlap with the reference (i.e. KðR; Q 0 Þ ¼ 0). The annotation Q 0 corresponds to partitioning a ! as ff4; 2g; f7g; f2; 2; 2gg, which is a valid solution to the scheduling problem instance i204 A.Gafurov et al.
For the if direction, suppose the elements of a ! are partitioned into b batches with the sum of lengths in each batch at most w. If we take all the intervals corresponding to a batch of tasks and place them side by side, with gaps of one between them, their total span on the genome will be at most w -1. We can then define Q 0 so that each of the b batches fits into one of the b empty regions, thereby assuring that KðR; Q 0 Þ ¼ 0. For the only if direction, assume that there exists a Q 0 with KðR; Q 0 Þ ¼ 0. If an empty region contains x intervals, then, due to the gap of at least one between adjacent intervals, the sum of their lengths is at most wx. Given the definition on the lengths of the intervals in Q 0 , the tasks corresponding to those intervals would have length of at most w and could fit into one batch.
Since there are b empty intervals, this implies a partitioning of the tasks into b batches, which satisfies the scheduling problem. Let ðp 0 ; p 1 ; . . . ; p m Þ be the solution of the P-value problem under H GS 0 for Q. By definition, p 0 À p 1 is the probability under H GS 0 of generating a set of intervals Q rand with KðR; Q rand Þ ¼ 0. We observe that p 0 > p 1 iff there exists at least one interval set Q rand with KðR; Q rand Þ ¼ 0. Combining with our reduction, p 0 > p 1 iff the scheduling problem has a solution. To wrap up, we have given a reduction, computable in polynomial time, that shows that if there is a polynomial time algorithm for the P-value under H GS 0 , then there is a polynomial time algorithm for the scheduling problem. Since the scheduling problem is N P-complete, this implies that the P-value problem under H GS 0 is N P-hard.

0
In this section we present Algorithm 1 which samples annotations from the gold standard null distribution (H GS 0 ) in OðjQj log LÞ time per sample. A fast-sampling algorithm serves at least two purposes. First, when the P-value is not too small, the algorithm can approximate it with a reasonable number of samples. Second, even if the Pvalue is small, the algorithm can provide the ground truth for comparing the fidelity of an alternate null hypothesis to H GS 0 . If the weight of the query is low, rejection sampling (Devroye, 1986) can be used. This scheme places the start of each interval uniformly at random; if the resulting set of intervals is non-intersecting and separated, the sample is kept; if not, it is rejected. However, when the weight of the query is high, rejection dominates and the number of attempts needed is prohibitively high. We therefore pursue a rejection-free approach.Algorithm 1 defines Q rand (Line 10) by first choosing the order O in which the required interval lengths appear (Line 1) and then setting the lengths of the jQj þ 1 gaps (i.e. Gaps½0 . . . jQj) such that their sum is L À weightðQÞ (Lines 2-9). To force Q rand to be separated, it initializes each gap to be 1, except the first and last gap (Line 2). That leaves L À weightðQÞ À jQj þ 1 of free space to distribute among the jQj þ 1 gaps (Line 3). It does this by sequentially processing all the gaps and, for each gap, sampling an integer x between 0 and the amount of remaining free space, increasing the gap by x, and decreasing the remaining free space by x (lines 5-7). To show that every Q rand is chosen with equal probability and in the necessary time, we will prove the following theorem: Before proving Theorem 2, we motivate the need for using the Beta-Binomial distribution to distribute the free space by considering two alternative ideas that are simpler but do not work. First, we could choose the free space for each gap by sampling uniformly from the amount of remaining free space. To see that this approach leads to unequal probabilities for different annotation, consider the example of lensðQÞ ¼ f1; 1g and L ¼ 10. The probability of sampling the annotation f½3; 4Þ; ½9; 10Þg is 1 8Á5 , whereas the probability of sampling the annotation f½7; 8Þ; ½9; 10Þg is 1 8Á1 Second, we could distribute the free space by taking a sample from the multinomial distribution with a number of trials equal to the free space, jQj þ 1 categories, and category probabilities equal to 1 jQjþ1 . This approach also leads to non-uniform sampling of annotations; i.e. for the above example, the probability of sampling the two annotations is 7! 3!4!0! Á 1 3 7 % 0:016 and 7! 7!0!0! Á 1 3 7 % 0:0004, respectively.
PROOF OF THEOREM 2. We have already argued that Algorithm 1 outputs an annotation. To prove the theorem, we first prove that the output Q rand is chosen uniformly at random from all annotations with lensðQ rand Þ ¼ lensðQÞ and, second, that Algorithm 1 runs in time OðjQj log LÞ.
Distributing the free space among the jQj þ 1 gaps can be restated as finding a b-partition. A b-partition of a non-negative integer U is an ordered b-tuple ðu 1 ; . . . ; u b Þ of non-negative integers whose sum is equal to U. In our case, b ¼ jQj þ 1 and U is the free space. Note that for every permutation of the interval lengths, the number of annotations Q rand such that lensðQ rand Þ ¼ lensðQÞ is identical and equal to the number of ðjQj þ 1Þ-partitions of the free space. Therefore, to sample Q rand uniformly, it suffices to first sample the permutation uniformly and then sample the b-partition uniformly. It remains to show that Algorithm 1 chooses a b-partition uniformly at random.
To count the total number of b-partitions of U, we can think of inserting b -1 separators into a number line of length U, resulting in a sequence of U þ b À 1 slots with breakers assigned to b -1 of those slots. The number of partitions is thus the number of ways to choose b -1 slots out of U þ b À 1, which we denote as It is useful to redefine a b-partition of U recursively as follows.
. . . ; Ug and ðu 2 ; . . . ; u b Þ is a ðb À 1Þ-partition of U À u 1 . Observe that the number of b-partitions with u 1 ¼ i is SðU À i; b À 1Þ. Therefore, to sample a b-partition uniformly in this recursive manner, we need to choose a value of u 1 such that the chosen value is i with probability SðUÀi;bÀ1Þ SðU;bÞ . We can equivalently rewrite this (the algebraic derivation is in the Supplementary Section S1) as sampling u 1 from the following cumulative distribution function: This distribution on u 1 is the Beta-Binomial distribution with parameters a ¼ 1, b ¼ b À 1, and U trials. Sampling from this Algorithm 1 Sample from H GS 0 Input: Genome length L and an annotation Q Output: An annotation Q rand , drawn from H GS 0 Notation: BBðt; a; bÞ is the Beta-Binomial distribution with parameters t (i.e. the number of trials), a and b. It gives an integer between 0 and t. 1: O½1 . . . jQj a permutation of lens(Q) chosen uniformly at random 2: Gaps½0 . . . jQj ð0; 1; . . . ; 1; 0Þ 3: freeSpaceRemaining L À weightðQÞ À jQj þ 1 4: for i ¼ 0 to jQj À 1 do 5: x sample from BBðfreeSpaceRemaining; 1; jQj À iÞ 6: Gaps½i Gaps½i þ x 7: freeSpaceRemaining freeSpaceRemaining À x 8: end for 9: Gaps½jQj freeSpaceRemaining 10: Q rand annotation defined by placing intervals with lengths given by O onto the genome, with gaps between them defined by Gaps. 11: return Q rand distribution is indeed what the algorithm does (Line 5), thus the algorithm samples a jQj þ 1-partition of free space uniformly at random.
To prove the runtime of the algorithm, first observe that uniformly sampling a permutation of jQj elements (Line 1), initializing Gaps (Line 2), and constructing Q rand (Line 10) can be done trivially in OðjQjÞ time. All other operations besides sampling from the Beta-Binomial distribution (Line 5) can be done in constant time. Thus, the total runtime is OðjQjÞ multiplied by the time for sampling from the Beta-Binomial distribution. This can be done using the technique of inverse transform sampling, using a binary search to compute the inverse transform [as described in Devroye (1986)]. Briefly, the technique works by first choosing a real number y uniformly between 0 and 1 and then finding the smallest i such that y Pr½u 1 i. The search for i can be done in Oðlog UÞ time using binary search, because the cumulative distribution function is monotone and can be evaluated in constant time using Equation (1). Thus, the total runtime of the algorithm is OðjQj log UÞ ¼ OðjQj log LÞ.

The Markov chain null hypothesis (H MC 0 )
Given the challenges of computing exact P-values under H GS 0 (Sarmashghi and Bafna, 2019) and the fact that doing so is N Phard in the general case (Theorem 1), we turn to a more tractable yet still faithful null hypothesis. We are inspired by the somewhat parallel problem of generating random strings that have desired nucleotide frequencies (Robin et al., 2005). In what is called the permutation model, the nucleotide frequencies in the generated strings must be exactly the desired ones, while in the more relaxed Markov chain model, the nucleotide frequencies in the generated strings must be the desired ones only in expectation. Many quantities turn out to be much simpler to compute in the Markov chain model than in the permutation model, making the relaxation a good tradeoff in many instances (Robin et al., 2005). We will show that the same idea carries over to our problem as well. An annotation can be generated using a two-state Markov chain as follows. The two states 0 and 1 correspond to being outside of an interval and inside an interval, respectively. The transition probabilities are given by T ¼ ð t 00 t 10 t 01 t 11 Þ, where, where t ij is the probability of transitioning from state i to state j. Let p ! ¼ p 0 ; . . . ; p LÀ1 be the states of this Markov chain, after L steps. For simplicity, the initial state p 0 and the last state p LÀ1 are assumed to be 0. This sequence of states induces an interval set by combining all contiguous stretches of 1 states. For example, a sequence of states ð0; 0; 0; 1; 1; 1; 0; 1; 1; 0; 1; 0Þ would produce an interval set f½3; 6Þ; ½7; 9Þ; ½10; 11Þg. The Markov chain null hypothesis (H MC 0 ) is that the query is generated by this Markov chain with parameters Using a standard frequency counting approach (Koller and Friedman, 2009), these are the weights that maximize the probability of observing the given query set Q. Moreover, the expected length of the intervals and the gaps under H MC 0 is asymptotically (with growing L and n) equal to the average lengths of the intervals and the gaps in Q, respectively (Koller and Friedman, 2009 In this section, we give an algorithm (which we call MCDP) for the P-value problem under H MC 0 . In particular, given a genome length L, a reference annotation R, and a query annotation Q, the problem is to output ðp 0 ; . . . ; p m Þ, where: We first need some definitions. Recall that n ¼ jQj and m ¼ jRj. Let R ¼ f½b 1 ; e 1 Þ; . . . ; ½b m ; e m Þg be the reference interval set. Let us also define e 0 :¼ 1 (note we set e 0 ¼ 1 in order to accommodate for the fact p 0 is always 0 in our definition) and b mþ1 :¼ L. Let g j :¼ b j À e jÀ1 , for 1 j m þ 1, and let l j :¼ e j À b j , for 1 j m. These are the lengths of the gaps and the intervals in R, respectively. We define Z any ðiti 0 Þ as the probability of being in state i 0 given that t positions earlier the state was i. We define Z zeros ðiti 0 Þ as the probability that the next t states are all zero, with the last of these states being i 0 , conditioned on the current state being i. Naturally, if i 0 ¼ 1 and t ! 1, then this probability is 0. We postpone the computation of Z any and Z zeros until Section 6.2.

Dynamic programming: simple form
Our algorithm for computing the P-value under H MC 0 is based on dynamic programming, building upon some of the ideas of the algorithm of Sarmashghi and Bafna (2019). It builds a table of entries P DP ½j; j; s, for 0 j m; 0 j m and s 2 f0; 1g. P DP ½j; j; s denotes the probability of hitting (i.e. intersecting) exactly j reference intervals after generating e j states of the Markov chain (i.e. finishing at the last base of jth reference interval, which is e j À 1), with the last generated state being s. From this table, we can compute p k as follows: The base cases of P DP follow from definitions: P DP ½j; j; s ¼ 1 when j ¼ j ¼ s ¼ 0 and P DP ½j; j; s ¼ 0 when either 1) j ¼ j ¼ 0 and s ¼ 1, 2) j < j, or 3) j < 0. The general case recurrence, which we explain below, is given by: P DP ½j; j; s ¼ P DP ½j À 1; j; 0P nohit ðj; 0; sÞ þP DP ½j À 1; j; 1P nohit ðj; 1; sÞ þP DP ½j À 1; j À 1; 0P hit ðj; 0; sÞ þP DP ½j À 1; j À 1; 1P hit ðj; 1; sÞ: (2) To understand this recurrence, we need to define the functions P nohit and P hit . We define P nohit ðj; i; i 0 Þ as the probability that p ! does not hit the jth reference interval and is in state i 0 at position e j À 1, given that the state at position e jÀ1 À 1 is i. In other words, it is the probability that p x ¼ 0 for b j x < e j and p ejÀ1 ¼ i 0 , given that p ejÀ1À1 ¼ i. This can happen in one of two ways: either p bjÀ1 is zero or one. In either case, we need to take g j steps to transition from i to p bjÀ1 , using any intermediate states, and then l j steps to transition from p bjÀ1 to i 0 , using only zero states. Thus, In a parallel fashion, we define P hit ðj; i; i 0 Þ as the probability that p ! intersects the jth reference interval and is in state i 0 at position e j À 1, given that the state at position e jÀ1 À 1 is i. In other words, it is the probability that p ejÀ1 ¼ i 0 and there exists x 2 fb j ; . . . ; e j À 1g such that p x ¼ 1, given that p ejÀ1À1 ¼ i. This can be calculated as the probability of going from i to i 0 using any states minus the probability of going from i to i 0 in a way that does not hit the jth interval. Thus, P hit ðj; i; i 0 Þ ¼ Z any ði ! gjþlj i 0 Þ À P nohit ðj; i; i 0 Þ Now we can justify the general dynamic programming recurrence (Equation 2). There are two ways that we can hit exactly j of the first j reference intervals. Either, we have hit j of the first j -1 intervals and then do not hit the jth interval, or we have hit j À 1 of the first j -1 intervals and then hit the jth interval.

Dynamic programming: efficient formulation
Computing the simple dynamic programming table given by Equation (2) reduces to repeatedly evaluating Z any and Z zeros . In this subsection, we show how to efficiently do this by reformulating the dynamic programming table using matrix notation.
The probability given by Z any ði ! a i 0 Þ can be written compactly by taking the value in row i and column j of the matrix obtained by raising T to the ath power (Norris, 1998). In other words, Z any ði ! a i 0 Þ ¼ ðT a Þ ii 0 : For Z zeros ði ! a i 0 Þ, we must modify the transition matrix to forbid any transitions into the 1 state by setting those probabilities to zero. That is, we define T mod :¼ ð t 00 0 t 10 0 Þ and Z zeros ði ! a i 0 Þ ¼ ðT mod a Þ ii 0 : Using these definitions of Z, we can more naturally express P nohit using a dot product, i.e. P nohit ðj; i; i 0 Þ ¼ ðT gj Á T mod lj Þ ii 0 . Similarly, we can ii 0 : Given that we can express P nohit and P hit using matrix operations, it becomes more natural to view the dynamic programming table P DP as a 2D table P DP2 ½j; j with vectors of length 2 as its cells. The base cases of this table are given by: P DP2 ½j; j ¼ ð 1 0Þ when j ¼ j ¼ 0, and P DP2 ½j; j ¼ ð 0 0Þ when j < j or j < 0. The general case is now given by: To efficiently compute the required matrix exponentiation, we observe that T and T mod are both diagonizable. This allows them to be exponentiated in just two matrix multiplications (Margalit and Rabinoff, 2017): This calculation takes constant time, since our matrices have constant size (i.e. 2 rows by 2 columns). To reduce computational overhead when the exponent is small (a 50), we use the simpler exponentiation by squaring technique (Gordon, 1998).
Finally, we have that the time complexity of the dynamic programming is Oðm 2 Þ, since each cell takes Oð1Þ time to compute. Combined with the OðnÞ time to compute the parameters of the Markov chain, the total time complexity of our algorithm for computing P-values under H MC 0 is Oðm 2 þ nÞ. The memory complexity is OðmÞ, since only the last two rows and the last column are needed to be held in memory.

Evaluated algorithms and metrics
In order to examine the performance of MCDP, we compare SBDP, the order-restricted dynamic programming algorithm of Sarmashghi and Bafna (2019). We do not compare to the Poisson Binomial Approximation algorithm of Sarmashghi and Bafna (2019), since they show it to be substantially less accurate in practice. We also do not compare with alternate approaches that ignore interval lengths, since they were shown to be inaccurate in many cases (Sarmashghi and Bafna, 2019). To evaluate accuracy with respect to H GS 0 , we use Algorithm 1 with 10 000 samples. We conducted experiments on a server with Intel V R Xeon V R CPU E5-2680 v3 at 2.50 GHz Â 48 and 96 GB DDR4 RAM, using a single thread.

Datasets
Synthetic data: To simulate a chromosome annotation, we have parameters specifying the chromosome length, the number of intervals and the desired coverage. Coverage is the ratio of the sum of the interval lengths to the chromosome length. All interval lengths are identical and set to the coverage times the chromosome length divided by the number of intervals. The locations of the intervals are chosen uniformly at random using Algorithm 1. We chose to have fixed-length intervals in order to limit the dimension of our evaluation space; however, we note that this choice does not favor MCDP, since H MC 0 actually gives a geometric distribution of interval lengths.
Real datasets from Sarmashghi and Bafna (2019): Sarmashghi and Bafna (2019) evaluated SBDP on four real datasets which cover diverse biological applications, summarized in Table 1. All datasets use the multi-chromosome hg19 human genome assembly. In order to run MCDP, we merged any intersecting or non-separate intervals. Note that we combine the P-value solution for individual chromosomes into a P-value solution for multiple-chromosomes using the post-processing algorithm described in Sarmashghi and Bafna (2019); it runs in time OðNm 2 Þ and memory OðNmÞ, where N is the number of chromosomes.
The first dataset (labeled EC) uses copy number amplifications that are recurrent in The Cancer Genome Atlas as the reference annotation and regions enriched for extra-chromosomal DNA sequence as the query annotation (Turner et al., 2017). In the second dataset (labeled CS), the reference annotation is the set of regions that were assigned a certain chromatin state (9) by a computational study of Ernst and Kellis (2010) and the query annotation is the set of promoter regions obtained from RefSeq (Sarmashghi and Bafna, 2019). The third dataset (labeled CNV) has the set of all non-coding genes as the reference annotation and the set of all regions with copy number gains as the query annotation (Zarrei et al., 2015). The fourth dataset (labeled H3K4me3) has the set of promoter regions as the reference annotation and regions highly enriched for H3K4me3 histone modification as the query annotation (Guenther et al., 2007). Sarmashghi and Bafna (2019) showed that the SBDP P-values in these datasets are significant, indicating an association between two biological mechanisms and suggesting future directions of study. Although we will show that MCDP gives different P-values, they will still be significant. We focus our analysis and discussion on the runtime, memory usage and accuracy of our tools, rather than the Notes: Column 'n. unmerged' denotes the number of intervals in the original input file, before we merged non-separated intervals.
biological interpretation of the significance that was already discussed by Sarmashghi and Bafna (2019). Real datasets from Zarrei et al. (2015;Fig. 3b): This study looked at the significance of the overlap between copy number losses in the human genome (23 438 intervals) and all exons from RefSeqannotated human genes (217 527 intervals; Zarrei et al., 2015;Fig. 3b). They further grouped the genes according to properties of interest such as being non-coding or being involved in cancer. Supplementary Table S1 shows the properties of these annotations, and Supplementary Table S4 gives a description of the biological significance of each gene group. We use the copy number loss annotation as the reference and the various gene groups as the query. Unfortunately, a small fraction of the gene names used in the original study could not be located in RefSeq and, for the rest, the gene coordinates in the current RefSeq are possibly different from the time of the original study; therefore, our annotations are not identical.

MCDP is more efficient than SBDP
Synthetic data: Table 2 shows the runtime and memory comparison of MCDP and SBDP on synthetic data. We have varied the size of chromosome as well as the number of intervals in the reference and query, since these are the parameters that appear in the theoretical time complexity of these algorithms. The empirical runtime of MCDP is consistent with the theoretical predictions of Oðm 2 þ nÞ, i.e. it is essentially independent of the chromosome length L and the dependence on the size of the query (n) is dominated by the dependence on the size of the reference (m). The memory usage, on the other hand, seems empirically to be fairly constant, even though the theory predicts it to be OðmÞ. On further investigation, we found that running MCDP on an empty input takes around 130 MB, implying that the heap memory used by MCDP is negligible compared with the baseline Python overhead.
In general, the results show that MCDP can be comfortably used with chromosomes of any length and with all the tested interval annotations. On the other hand, SBDP (without scaling) uses orders of magnitude more running time and memory which quickly becomes prohibitive with an increase in any of the parameters. For example, the analysis of a human-length chromosome (i.e. 10 8 bp) with only 50 reference and query intervals already takes $41 GB of memory and more than 2 h, while MCDP takes <1 min and less than half a GB of RAM. When the number of intervals in either the reference or query annotation is increased, SBDP exceeds the memory capacity of our machine ($100 GB); moreover, the extrapolated runtime (based on the theoretical scaling of OðnmLÞ) is more than 10 days. Note that these results do not invalidate the SBDP approach but rather motivate that a scaling factor usually needs to be applied before running SBDP.
Real datasets from Sarmashghi and Bafna (2019): Table 3 shows the run time and memory usage on the real datasets from Sarmashghi and Bafna (2019). In order to run SBDP, we needed to use a scaling factor no larger than ¼ 10 À3 ; we also tried ¼ 10 À4 , which has a runtime which may be deemed more practical for some of the datasets. We did not use a smaller scaling factor since, as we will show in Section 7.4, the accuracy of SBDP already begins to deteriorate with ¼ 10 À4 .
The running time of MCDP is roughly one order of magnitude less than SBDP with ¼ 10 À4 , and two orders of magnitude less than SBDP with ¼ 10 À3 . For the largest dataset (H3K4me3), MCDP took 11 min, whereas SBDP took 25 (respectively, 2.5) h with ¼ 10 À3 (respectively, 10 À4 ). We conclude that on large datasets, MCDP is faster and more memory efficient than SBDP even after scaling.
7.4 MCDP is more accurate than SBDP with practical scaling Synthetic data: We generated synthetic data by varying the number of intervals in the reference and query and the length of the query intervals; we kept the genome length and the reference interval length constant. Table 4 shows the predicted critical value (i.e. the value of K(R, Q) for which P ¼ 0.05) for MCDP and SBDP with various scaling parameters. For completeness, we also include measurements of Kullback-Leibler divergence and mean square bias (Supplementary Table S2) in the Supplementary Appendix, though the numbers are harder to interpret. To better make summary observations, Table 4 highlights the simulation parameters under which the critical value is at least 10% different from the sampled H GS 0 . We see that MCDP is accurate in all but two of the cases (Lines 25 and 26 in Table 4). These cases belong to the only group (jRj ¼ 2500 and ' q ¼ 1000Þ where a query interval is expected to cover more than one reference interval. This group pushes to the limit the assumption of our underlying Markov chain model, that the query interval lengths are geometrically distributed. As a result, MCDP's PMF is more heavy-tailed than the sampled one ( Supplementary Fig. S1).
When compared with SBDP, MCDP is substantially more accurate at ¼ 10 À3 and even at ¼ 10 À2 . At ¼ 10 À1 , SBDP is more accurate than MCDP, indicating that the improved accuracy of MCDP over SBDP is due to its computational efficiency rather than a more accurate algorithm. Note however that running SBDP with ¼ 10 À1 is infeasible for most human-scale datasets.
Real datasets from Sarmashghi and Bafna (2019): Figure 2 shows the P-values computed by MCDP and SBDP, compared with sampled H GS 0 (Sarmashghi and Bafna, 2019). We use ¼ 10 À4 and ¼ 10 À3 for SBDP, since larger scaling values exceeded our runtime cutoff (4 h) and/or server memory. For the EC dataset, we see that  (-) indicate that the experiment either ran for more than 4 h or exceeded the memory capacity of our machine ($100 GB). MCDP is closer to H GS 0 than SBDP with either scaling factor. For the other three datasets, the P-values are too small to be computed by sampling. To better understand the accuracy in these cases, we plotted the probability mass function generated by all the approaches (Fig. 3). Overall, MCDP produces a distribution with a mode closer to the gold standard than does SBDP. In particular, for ¼ 10 À3 , while the modes of MCDP and of SBDP are both roughly accurate for EC and CNV, the modes of MCDP are better for H3K4me3 and CS. For ¼ 10 À4 , SBDP's modes are substantially shifted for all datasets except EC. Figure 3 also shows the critical value for each dataset. MCDP is always closer to the true value, with an error of at most 5%. SBDP with scaling factor ¼ 10 À3 is slightly worse but still close, whereas SBDP with scaling factor ¼ 10 À4 shows major discrepancies. In particular, there is a 90% error for CS that results in the observed overlap (KðR; QÞ ¼ 344) being mistakenly deemed insignificant at level 0.05.

Large-scale study replication with MCDP
To demonstrate how MCDP enables the analysis of large datasets, we replicated the enrichment/depletion analysis from Zarrei et al. (2015;Fig. 3b). On the whole set of RefSeq exons (217 527 reference intervals), MCDP completes in <17 h (Supplementary Table  S1). Running SBDP with a scaling factor of ¼ 10 À3 exceeded the memory of our server ($100 GB). We did not attempt ¼ 10 À4 since, as we saw in Section 7.4, this gives inaccurate P-values in nearly all our tests. Zarrei et al. (2015) measured the significance of enrichment or depletion using sampling (though the details of the sampling strategy were unclear to us; Fig. 3b) and reported a range for the P-values. Figure 4 shows the P-values obtained by MCDP in comparison with the ranges given by Zarrei et al. (2015). The P-values are generally similar but in three cases MCDP suggests depletion and the original analysis enrichment. For the Protein-coding dataset this is caused by the changes in the underlying annotations, and for the All genes and No phenotype datasets the difference stems from the fact that Zarrei et al. use a different null hypothesis and measure the significance of the number of intersecting nucleotides whereas we look at the number of intersecting intervals. We discuss these cases further in Supplementary Section S2.

Conclusion
In this article, we have studied the problem of computing the significance (i.e. P-value) of the number of overlaps between two genome annotations. We have shown that the gold standard null hypothesis formulation (H GS 0 ), implicitly assumed in other works (e.g. Ernst and Kellis, 2010;Zarrei et al., 2015), leads to an N P-hard problem (Section 3). This motivated us to propose an alternative null hypothesis, based on Markov chains, which relaxes some of the constraints of H GS 0 (Section 5). We have then devised an algorithm MCDP to compute P-values under this hypothesis in time Oðm 2 þ nÞ and space OðmÞ (Section 6). Unlike the previous approach of SBDP, which has OðnmLÞ runtime and OðmLÞ memory, MCDP's resources are independent of the genome length, making it substantially more efficient on human annotations. To evaluate the accuracy of MCDP and SBDP against H GS 0 , we have derived an efficient direct sampling algorithm, which, to the best of our knowledge, is the first such algorithm that does not rely on rejection sampling (Section 4). Our experiments have shown that while SBDP can achieve higher accuracy than MCDP with respect to H GS 0 by setting the scaling parameter closer to 1, doing so can result in prohibitive time and/or memory usage on human-scale datasets (Section 7). Overall, MCDP requires orders of magnitude less time and memory at comparable levels of accuracy.
We see various methodological directions to improve our algorithm. The first major direction is improving the run time of MCDP. Constant speedups can be obtained by porting the code from Python into Cþþ or by parallelizing it. Improving the quadratic dependence on the number of query intervals can potentially be done by employing the fast Fourier transform (Kozen, 1992), though it may negatively affect accuracy (Nagarajan et al., 2005). The second direction is improving the accuracy of MCDP. Our H MC 0 null hypothesis implicitly models the interval lengths by a geometric distribution, which may be suboptimal in some biological contexts. This can be solved within the MC framework using discrete phase-type distributions, which can fit a true distribution of interval lengths at the expense of running time (Isensee and Horton, 2005).
The Markov chain framework we have applied in this article lends itself to efficiently addressing the additional challenges that may arise in biological applications. For example, in the analysis of Zarrei et al. (2015), it might be more insightful to determine whether a specific group of genes has significantly more variants than an average gene, rather than a random subset of the whole genome. More generally, it might make sense to exclude certain regions (e.g. centromeres) from the annotations allowed by the null hypothesis (Kanduri et al., 2019). One could also train different models for certain regions in order to account for a confounding factor (e.g. GC content) and chain these regions together. Another challenge is that in our current framework, we assign the same significance to an overlap of 1 nt as to an overlap of 1000 nt. In situations where this becomes a factor, one could either require a larger minimum overlap length (as done by Sarmashghi and Bafna (2019)) or use different colocalization metrics such as overlap coverage or Euclidean/cosine metrics (Gu et al., 2021). Finally, some applications may require comparing annotations of a graphical pangenome rather than a single linear genome (Rand et al., 2017). Some of these challenges have been addressed in various other frameworks, though not comprehensively and often ad hoc (Kanduri et al., 2019). These challenges could potentially be naturally expressed in the language of Markov chains (by adjusting its structure) or other stochastic models (e.g. random walks), which could serve to unify the existing approaches. We see this bridge between annotation overlap significance and the deeply studied area of stochastic modeling as our main conceptual contribution.   (2019), shown on a negative log scale. On all datasets except EC, the sampled H GS 0 is not shown because every one of the 10 000 samples had less than K(R, Q) overlaps; the rule of three (Burns, 1983) implies that in this case, with 10 000 samples, the true P-value can only be said to be less than P < 3 Â 10 À4 with 95% probability i210 A.Gafurov et al.  Fig. 3b) analysis, where we use copy number losses as a reference and each of the columns as a separate query. The columns correspond to various subset of exons from RefSeq human genes; the exact descriptions are in Supplementary Table S4. The two panels show P-values for enrichment and depletion of copy number losses in the corresponding gene column. The blue dots represent the P-values produced by MCDP, the green boxes represent the Pvalue range reported by Zarrei et al. (2015). We clip the y-axis at the top, and the points and ranges at those boundaries actually have more extreme values. The orange dashed lines (lower) correspond to significance level 0.05, and the red dashed lines (upper) correspond to Bonferroni-adjusted significance level (i.e. 0:05 13 , since there are 13 experiments in this case). Note that MCDP can report the P-value for both enrichment and depletion because it computes the probability mass function for all values of K(R, Q) (A color version of this figure appears in the online version of this article)